PolyUHK: A Robust Information Extraction System for Web Personal Names

نویسندگان

  • Ying Chen
  • Sophia Yat Mei Lee
  • Chu-Ren Huang
چکیده

Personal information extraction is an important component of advanced information retrieval. There are two problems needed to be solved in this practical task: personal name ambiguity and extraction of personal information for a specific person. For personal name ambiguity, which is a very common phenomenon in the fast growing Web resource, we propose a robust system which extracts features with a totally unsupervised approach from resources beyond the given Web corpus. The experiments show that these broad features not only can improve performances, but also increase the robustness of a disambiguation system. For personal information extraction, a rule-based information extraction system is introduced, which is able to re-use current well-developed tools effectively and identify the properties of Web data. The experiments show that the system can achieve state-of-the-art performances, especially the high precision.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spontaneous identification of individual nick name from web

A person is generally called by different names, it is difficult to identify a person from the web, person will be called by different names by different people for example, Michael Jackson is called as MJ and some call him ” king of pop” , so there will be not trouble-free in penetrating the names from the web . Accurate identification of name of a given person is useful in various web related...

متن کامل

Automatically Extracting Personal Name Aliases from the Web

An entity can be referred by multiple name aliases on the web. Extracting aliases of an entity is important for various tasks such as identification of relations among entities, automatic metadata extraction and entity disambiguation. To extract relations among entities properly, one must first identify those entities. Aliases of an entity are useful as metadata for that entity and can be used ...

متن کامل

Automatic Discovery of Lexical Patterns using Pattern Extraction Algorithm to Identify Personal Name Aliases with Entities

The personal name aliases are extremely significant in information retrieval to retrieve complete information about a personal name from the web, as some of the web pages of the person may also be referred by his or her alias name / nick name / real name. There is a rapid growth in people searching where the personal name aliases are concerned. We proposed a pattern generator which includes aut...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009